An HMM - based Method for Segmenting Japanese Terms and Keywords based on Domain - Speci c Bilingual Corpora

نویسندگان

  • Keita Tsuji
  • Kyo Kageura
چکیده

Much e ort has been devoted to Japanese morphological analysis (Matsumoto et al 1996; Nagata 1994; Papageorgiou 1994; Yamamoto & Masuyama 1997). The general performances of many of them are quite good. For the purpose of processing complex terms and keywords, however, they have some problems and limitations. Firstly, most general purpose morphological analysers do not segment complex words consistently 1 . There are a few morphological analysers (e.g. Takeda & Fujisaki 1987) which is specialised for the analysis of Chinese-origined units, but they cannot be easily extended for other units such as Katakana. In addition, in IR-oriented applications which we have in mind, the performances of the systems with pre-de ned dictionaries tend to be much lower than their average performance, because of the high domain-speci city. So the proper method for segmenting Japanese complex lexical units is needed as a part of extracting translation patterns of terms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Disambiguation of Lexical Translations Based on Bilingual Comparable Corpora

Bilingual dictionaries of machine readable form are important and indispensable information resources for cross-language information retrieval (CLIR), machine translation(MT), and so on. Speci c academic areas or technology elds become focused on in these cross language informational activities. In this paper, we describe bilingual dictionary acquisition system which extracts translations from ...

متن کامل

Bilingual lexicon extraction from comparable corpora using in-domain terms

Many existing methods for bilingual lexicon learning from comparable corpora are based on similarity of context vectors. These methods suffer from noisy vectors that greatly affect their accuracy. We introduce a method for filtering this noise allowing highly accurate learning of bilingual lexicons. Our method is based on the notion of in-domain terms which can be thought of as the most importa...

متن کامل

Speech enhancement based on hidden Markov model using sparse code shrinkage

This paper presents a new hidden Markov model-based (HMM-based) speech enhancement framework based on the independent component analysis (ICA). We propose analytical procedures for training clean speech and noise models by the Baum re-estimation algorithm and present a Maximum a posterior (MAP) estimator based on Laplace-Gaussian (for clean speech and noise respectively) combination in the HMM ...

متن کامل

Automatic Construction of a Japanese-Chinese Dictionary via English

This paper proposes a method of constructing a dictionary for a pair of languages from bilingual dictionaries between each of the languages and a third language. Such a method would be useful for language pairs for which wide-coverage bilingual dictionaries are not available, but it suffers from spurious translations caused by the ambiguity of intermediary third-language words. To eliminate spu...

متن کامل

Compiling French-Japanese Terminologies from the Web

We propose a method for compiling bilingual terminologies of multi-word terms (MWTs) for given translation pairs of seed terms. Traditional methods for bilingual terminology compilation exploit parallel texts, while the more recent ones have focused on comparable corpora. We use bilingual corpora collected from the web and tailor made for the seed terms. For each language, we extract from the c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997